-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix IndexAuditTrail rolling upgrade on rollover edge - take 2 #38286
Fix IndexAuditTrail rolling upgrade on rollover edge - take 2 #38286
Conversation
Pinging @elastic/es-security |
@@ -337,7 +349,7 @@ public void onResponse(ClusterStateResponse clusterStateResponse) { | |||
updateCurrentIndexMappingsIfNecessary(clusterStateResponse.getState()); | |||
} else if (TemplateUtils.checkTemplateExistsAndVersionMatches(INDEX_TEMPLATE_NAME, | |||
SECURITY_VERSION_STRING, clusterStateResponse.getState(), logger, | |||
Version.CURRENT::onOrAfter) == false) { | |||
Version.CURRENT::onOrBefore) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transitionStartingToInitialized(); | ||
} | ||
} else { | ||
@SuppressWarnings("unchecked") | ||
Map<String, Object> meta = (Map<String, Object>) docMapping.sourceAsMap().get("_meta"); | ||
if (meta == null) { | ||
logger.info("Missing _meta field in mapping [{}] of index [{}]", docMapping.type(), index); | ||
throw new IllegalStateException("Cannot read security-version string in index " + index); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes #37062 (comment) .
A non-master node detects an un-updated audit index and bails. Instead it should hold off, and retry. The index is un-updated because the master had updated the mapping for the index before it the rollover timeline ("the race" - the template upgrade happend after the rollover edge, but audit events on the master came before that).
innerStart(); | ||
}, e2 -> { | ||
// best effort only | ||
logger.debug("Failed to update mappings on next audit index [{}]", nextIndex, e2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
master tries to update the mapping for the next rollover index, just in case....
@@ -217,6 +218,7 @@ subprojects { | |||
setting 'xpack.security.enabled', 'true' | |||
setting 'xpack.security.transport.ssl.enabled', 'true' | |||
setting 'xpack.security.transport.ssl.keystore.path', 'testnode.jks' | |||
setting 'logger.org.elasticsearch.xpack.security.audit.index', 'DEBUG' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should help in future possible failures!
public void testAuditLogs() throws Exception { | ||
assertBusy(() -> { | ||
assertAuditDocsExist(); | ||
assertNumUniqueNodeNameBuckets(expectedNumUniqueNodeNameBuckets()); | ||
}); | ||
}, 30, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allows some slack for creating and allocating a new audit index by the old nodes while the master is down for upgrade.
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/7058/console
@elasticmachine run elasticsearch-ci/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes a race during the rolling upgrade with the index audit output enabled. The race is that after the upgraded node is restarted, it installs the audit template and updates the mapping of the "current" (from his perspective) audit index. But the template might be installed after a new daily rolled-over index has been created by the other old nodes, using the old templates. However, the new node, even if it installs the template after the rollover edge, can accumulate audit events before the edge, and will correctly try to update the mapping of the audit index before the edge. But this way, the mapping of the index after the edge remains un-updated, because only the master node does the mapping updates. The fix keeps the design of only allowing the master to update the mapping, but the master will try, on a best effort policy, to also possibly update the mapping of the next rollover audit index.
Fixes a race during the rolling upgrade with the index audit output enabled. The race is that after the upgraded node is restarted, it installs the audit template and updates the mapping of the "current" (from his perspective) audit index. But the template might be installed after a new daily rolled-over index has been created by the other old nodes, using the old templates. However, the new node, even if it installs the template after the rollover edge, can accumulate audit events before the edge, and will correctly try to update the mapping of the audit index before the edge. But this way, the mapping of the index after the edge remains un-updated, because only the master node does the mapping updates. The fix keeps the design of only allowing the master to update the mapping, but the master will try, on a best effort policy, to also possibly update the mapping of the next rollover audit index.
* 6.6: (121 commits) [DOCS] Add warning about bypassing ML PUT APIs (elastic#38608) fix dissect doc "ip" --> "clientip" (elastic#38512) bad formatted JSON object (elastic#38515) SQL: Fix issue with IN not resolving to underlying keyword field (elastic#38440) Update ilm-api.asciidoc, point to REMOVE policy (elastic#38235) Backport changes to the release notes script. (elastic#38347) Change the milliseconds precision to 3 digits for intervals. (elastic#38297) SecuritySettingsSource license.self_generated: trial (elastic#38233) (elastic#38398) Fix IndexAuditTrail rolling upgrade on rollover edge 2 (elastic#38286) (elastic#38381) Cleanup construction of interceptors (elastic#38388) Skip unsupported languages for tests (elastic#38328) (elastic#38385) [ILM][TEST] increase assertBusy timeout (elastic#36864) (elastic#38354) Docs: Drop inline callout from scroll example (elastic#38340) (elastic#38365) Preserve ILM operation mode when creating new lifecycles (elastic#38134) (elastic#38230) [ML] Add explanation so far to file structure finder exceptions (elastic#38337) ML: Fix error race condition on stop _all datafeeds and close _all jobs (elastic#38113) (elastic#38211) (elastic#38222) SQL: Generate relevant error message when grouping functions are not used in GROUP BY (elastic#38017) Fix NPE in Logfile Audit Filter (elastic#38120) (elastic#38273) Enable trace log in FollowerFailOverIT (elastic#38148) Replace awaitBusy with assertBusy in atLeastDocsIndexed (elastic#38190) ...
Fixes a race during the rolling upgrade with the index audit output enabled.
The race is that after the upgraded node is restarted, it installs the audit template and updates the mapping of the "current" (from his perspective) audit index. But the template might be installed after a new daily rolled-over index has been created by the other old nodes, using the old templates.
However, the new node, even if it installs the template after the rollover edge, can accumulate audit
events before the edge, and will correctly try to update the mapping of the audit index before the edge. But this way, the mapping of the index after the edge remains un-updated, because only the master node does the mapping updates.
The fix keeps the design of only allowing the master to update the mapping, but the master will try, on a best effort policy, to also possibly update the mapping of the next rollover audit index.
This can be judged as a shoot in the dark because I don't have access to the failure data anymore, but I think the crumbles point in this direction. Moreover, turning up debugging will allow for easier future diagnosis.
Relates #35988
Closes #33867 #37062